Fix scalar matching in grapheme semantic mode #565

hamishknight · 2022-07-08T14:06:08Z

Previously we would allow consuming a single scalar of a grapheme in grapheme semantic mode when matching a DSL .scalar. Change this behavior such that it is treated the same as a character with a single scalar. That is:

In grapheme mode it must match an entire grapheme.
In scalar semantic mode, it may match a single scalar.

Additionally, start converting AST .scalar atoms into .scalar DSL atoms. This is both for consistency with the DSL, and allows us to preserve the \u{...} syntax in more cases when a DSL conversion is performed.

Resolves #563
Resolves #564
rdar://96662651

Previously we would allow consuming a single scalar in grapheme semantic mode for a DSL `.scalar`. Change this behavior such that it is treated the same as a character with a single scalar. That is: - In grapheme mode it must match an entire grapheme. - In scalar semantic mode, it may match a single scalar.

Convert AST scalars to DSL scalars such that we can preserve the `\u{...}` syntax where the user chooses to use it. This requires fixing up some escaping logic such that we don't escape `\u{...}` sequences.

hamishknight · 2022-07-08T14:06:14Z

@swift-ci please test

hamishknight · 2022-07-08T14:26:35Z

Tests/RegexBuilderTests/RegexDSLTests.swift

+    let r3 = Regex {
+      "👨" as UnicodeScalar
+      "\u{200D}" as UnicodeScalar
+      "👨" as UnicodeScalar
+      "\u{200D}" as UnicodeScalar
+      "👧" as UnicodeScalar
+      "\u{200D}" as UnicodeScalar
+      "👦" as UnicodeScalar
+    }
+    XCTAssertNil(try r3.firstMatch(in: "👨‍👨‍👧‍👦"))


Hmm, actually I'm not sure whether this is correct, as we allow matching for e.g:

print("👨‍👨‍👧‍👦".firstMatch(of: /👨\u{200D}👨\u{200D}👧\u{200D}👦/)

should we be combining the scalars in a DSL concatenation the same way we combine them in a literal? And if so, would it also apply to regex scalar children e.g /\u{...}/?

Or is a DSL concatenation semantically more like the following?

print("👨‍👨‍👧‍👦".firstMatch(of: /(?:👨)(?:\u{200D})(?:👨)(?:\u{200D})(?:👧)(?:\u{200D})(?:👦)/)) // nil

The scalars should get globbed together into a stream of scalars, over which grapheme breaking can occur. This is similar to scalar escaped inside string literals. For the DSL, I'd treat them like a stream of scalars ala /\u{...}\u{...}\u{...}.

Yeah makes sense, though we'll need to define exactly when the combining happens, e.g do we combine the following?

Regex { "a" as UnicodeScalar /\u{301}/ }

and would that still happen if there's further contents in the regex? (because this is now a concat within a concat) e.g:

Regex { "a" as UnicodeScalar /\u{301}b/ }

Should Regex {...} calls serve as a boundary between combining scalars? e.g:

Regex { "a" as UnicodeScalar Regex { "\u{301}" as UnicodeScalar "b" } }

rctcwyvrn · 2022-07-08T18:46:40Z

I'm still not convinced that we should have this scalar api while in grapheme semantics at all, converting it to a matching a character under the hood seems unintuitive from a user's perspective. What does a scalar RegexComponent even mean then? If it's given a single scalar character it'll match it as a character and if it's given something that isn't a character (like \u{301}) it'll just never match anything? Why even have this option in grapheme semantic mode then? Seems like strange behaviour to me

Would it be possible for us to instead throw an error if we have a .scalar while in grapheme semantic mode?

hamishknight · 2022-07-08T19:31:53Z

converting it to a matching a character under the hood seems unintuitive from a user's perspective

I don't disagree with this, though I should note that it's equivalent to using a /\u{...}/ regex literal in place of the unicode scalar. IMO the current behavior of potentially splitting graphemes is also somewhat surprising (especially in a character class where some elements could match only a scalar, others could match a character), and isn't something I'd expect in grapheme mode.

Why even have this option in grapheme semantic mode then? Seems like strange behaviour to me

Yeah, I'm starting to think we should maybe consider retracting these scalar APIs on RegexBuilder until we work out these details (there's only 2 of them AFAIK, the RegexComponent conformance + CharacterClass.anyOf overload). I think we'd still want this change to the matching engine tho, as /\u{...}/ should still match a grapheme IMO.

Would it be possible for us to instead throw an error if we have a .scalar while in grapheme semantic mode?

Yeah, that would be a possible option, though it would surface as a runtime crash on the first call to a matching function.

milseman

LGTM, but I think we can clean up the code gen path a little.

milseman · 2022-07-11T14:24:20Z

Sources/_StringProcessing/ByteCodeGen.swift

+      // A scalar always matches the same as a single scalar character. This
+      // means it must match a whole grapheme in grapheme semantic mode, but
+      // can match a single scalar in scalar semantic mode.
+      try emitCharacter(Character(s))


Shouldn't this be conditional on whether we're in grapheme semantics or scalar semantics?

emitCharacter will call emitConsumeScalar for scalar mode

milseman · 2022-07-11T14:27:39Z

Sources/_StringProcessing/ConsumerInterface.swift

+      // A scalar always matches the same as a single scalar character. This
+      // means it must match a whole grapheme in grapheme semantic mode, but
+      // can match a single scalar in scalar semantic mode.
+      return try Character(s).generateConsumer(opts)


This seems pretty convoluted. Why do we not just emit a scalar consuming instruction in scalar semantic mode and set the grapheme boundary bit? CC @rctcwyvrn to see if we need to get some of her instructions in first.

Can we do that within a consumer matching function? I know we eventually want to migrate off consumer functions, but for now this will generate a consumeScalar consumer in scalar semantic mode

This path should only be reached for non-ascii custom character classes so I don't think we can emit an instruction instead of a consume fn in that case. Maybe we could if we made custom character classes emit in the same way we do alternations?

milseman · 2022-07-11T14:30:02Z

Tests/RegexBuilderTests/RegexDSLTests.swift

+    let r3 = Regex {
+      "👨" as UnicodeScalar
+      "\u{200D}" as UnicodeScalar
+      "👨" as UnicodeScalar
+      "\u{200D}" as UnicodeScalar
+      "👧" as UnicodeScalar
+      "\u{200D}" as UnicodeScalar
+      "👦" as UnicodeScalar
+    }
+    XCTAssertNil(try r3.firstMatch(in: "👨‍👨‍👧‍👦"))


The scalars should get globbed together into a stream of scalars, over which grapheme breaking can occur. This is similar to scalar escaped inside string literals. For the DSL, I'd treat them like a stream of scalars ala /\u{...}\u{...}\u{...}.

rctcwyvrn · 2022-07-11T18:31:58Z

Edge case I ran into while updating #525

Should a single scalar composed value match the decomposed version while in grapheme mode? Ie "é" as UnicodeScalar

In the scalar optimizations branch I tried to make it match this PR's scalar matching by emitting .scalar as matchScalar + boundary check, but that would fail to match the decomposed version. This PR would match the decomposed version because it would convert it to a character and match by unicode equivalence.

I'm leaning towards that it shouldn't be able to match the decomposed version, as we asked it to match a scalar and it would end up matching a completely different scalar sequence

    let r4 = Regex { "é" as UnicodeScalar }
    XCTAssertNotNil( // nil or not nil?
      try r4.firstMatch(in: "e\u{301}")
    )
    XCTAssertNotNil(
      try r4.firstMatch(in: "é")
    )

hamishknight · 2022-07-11T19:31:04Z

I think I'm leaning towards saying it actually should match the decomposed version. IMO the UnicodeScalar conformance should be semantically equivalent to /\u{...}/, and in grapheme semantic mode IMO it seems reasonable to expect /\u{E9}/ to be the same regex as /e\u{301}/. This would also match with the expected behavior of combining all adjacent scalars such that they effectively become a .quotedLiteral rather than distinct semantic elements.

I do agree that it could potentially be unexpected tho.

rctcwyvrn · 2022-07-11T19:36:32Z

Sources/_StringProcessing/Regex/ASTConversion.swift

@@ -216,7 +216,7 @@ extension AST.Atom {

    switch self.kind {
    case let .char(c):                    return .char(c)
-    case let .scalar(s):                  return .char(Character(s.value))
+    case let .scalar(s):                  return .scalar(s.value)


Do we want the behaviour of regex to be different depending on if we input a unicode value as its scalar code or as a literal character? ie "ợ" vs "\u{1ee3}"? With this one would be interpreted as a .char and the other as a .scalar in the DSLTree. I think the intent for the original code here was to make those two inputs equivalent

I think they should be semantically equivalent (i.e they go through the exact same code paths in byte code emission), but this distinction allows us to have more accurate printing of the DSL tree when performing a literal -> DSL conversion

(if we didn't need the distinction for printing, IMO it would make sense to outright remove .scalar from the DSL tree)

If they behave the same then why would we need to print them differently? Printing them differently would imply that they behave differently I think.

Also yea, seeing as we're moving towards making a .scalar behave exactly the same as .char, maybe it would make sense to remove it outright

If they behave the same then why would we need to print them differently? Printing them differently would imply that they behave differently I think.

For example, if the user wrote /e\u{301}/, ideally the transformed DSL tree would be:

Regex { "e\u{301}" }

instead of:

Regex { "é" }

Semantically they're the same, but IMO it's much nicer to preserve the original syntax the user wrote.

milseman · 2022-07-11T23:07:56Z

Should a single scalar composed value match the decomposed version while in grapheme mode?

Yes, in grapheme cluster semantics we do Unicode Canonical Equivalence.

rctcwyvrn · 2022-07-11T23:37:42Z

Ok so #525, is ready to merge and up to date with this branch. Should we go ahead and merge that one?

@hamishknight were you planning on doing additional work on combining scalars together?

hamishknight · 2022-07-12T10:22:19Z

@rctcwyvrn I'm happy with landing both, I can work on the scalar combination logic in a follow-up

milseman · 2022-07-12T17:03:44Z

Let's merge this whenever you're ready @rctcwyvrn, and we can coordinate how to land @hamishknight's work on top and cherry-pick things together

hamishknight · 2022-07-12T17:32:27Z

@swift-ci please test

- Adds new instructions for matching characters and scalars case insensitively - Compiles ascii character matches into the faster scalar match instructions even in grapheme semantic mode - Optimizes out unnecessary runtime grapheme boundary checks for all ascii strings - Also includes fixes to scalar matching in grapheme semantic mode (#565)

rctcwyvrn · 2022-07-12T18:40:08Z

Changes merged in 7752047

) - Adds new instructions for matching characters and scalars case insensitively - Compiles ascii character matches into the faster scalar match instructions even in grapheme semantic mode - Optimizes out unnecessary runtime grapheme boundary checks for all ascii strings - Also includes fixes to scalar matching in grapheme semantic mode (swiftlang#565)

hamishknight added 2 commits July 8, 2022 14:57

Preserve scalar syntax in DSL conversion

c6bc811

Convert AST scalars to DSL scalars such that we can preserve the `\u{...}` syntax where the user chooses to use it. This requires fixing up some escaping logic such that we don't escape `\u{...}` sequences.

hamishknight requested review from milseman and rctcwyvrn July 8, 2022 14:06

hamishknight commented Jul 8, 2022

View reviewed changes

milseman approved these changes Jul 11, 2022

View reviewed changes

rctcwyvrn added a commit to rctcwyvrn/swift-experimental-string-processing that referenced this pull request Jul 11, 2022

Change scalar semantics to match swiftlang#565

bca1d2b

rctcwyvrn reviewed Jul 11, 2022

View reviewed changes

rctcwyvrn closed this Jul 12, 2022

hamishknight deleted the not-to-scale branch July 12, 2022 18:44

rctcwyvrn mentioned this pull request Jul 12, 2022

[5.7] Fix infinite loop and scalar matching in grapheme mode + scalar matching optimizations #569

Merged

Fix scalar matching in grapheme semantic mode #565

Fix scalar matching in grapheme semantic mode #565

Uh oh!

Conversation

hamishknight commented Jul 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hamishknight commented Jul 8, 2022

Uh oh!

hamishknight Jul 8, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rctcwyvrn commented Jul 8, 2022

Uh oh!

hamishknight commented Jul 8, 2022

Uh oh!

milseman left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rctcwyvrn commented Jul 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hamishknight commented Jul 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

milseman commented Jul 11, 2022

Uh oh!

rctcwyvrn commented Jul 11, 2022

Uh oh!

hamishknight commented Jul 12, 2022

Uh oh!

milseman commented Jul 12, 2022

Uh oh!

hamishknight commented Jul 12, 2022

Uh oh!

rctcwyvrn commented Jul 12, 2022

Uh oh!

Uh oh!

hamishknight commented Jul 8, 2022 •

edited

Loading

hamishknight Jul 8, 2022 •

edited

Loading

rctcwyvrn commented Jul 11, 2022 •

edited

Loading

hamishknight commented Jul 11, 2022 •

edited

Loading